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The microprocessor industry is currently struggling with higher development costs and 
longer design times that arise from exceedingly complex processors that are pushing the 
limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for 
important commercial applications, such as on-line transaction processing (OLTP), which 
suffer from large memory stall times and exhibit little instruction-level parallelism. Given 
that commercial applications constitute by fa ... 

2 Smart Memories : a modular recon figur able architecture 

Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, Mark Horowitz 

May 2000 ACM SIGARCH Computer Architecture News , Proceedings of the 27th 

annual international symposium on Computer architecture, volume 28 issue 2 
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Trends in VLSI technology scaling demand that future computing devices be narrowly 
focused to achieve high performance and high efficiency, yet also target the high volumes 
and low costs of widely applicable general purpose designs. To address these conflicting 
requirements, we propose a modular reconfigurable architecture called Smart Memories, 
targeted at computing needs in the O.l&mgr; technology generation. A Smart Memories 
chip is made up of many processing tiles, each containing local ... 
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November 2000 Proceedings of the ninth international conference on Architectural 
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Symmetric muultiprocessor (SMP) servers provide superior performance for the commercial 
workloads that dominate the Internet. Our simulation results show that over one-third of 
cache misses by these applications result in cache-to-cache transfers, where the data is 
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found in another processor's cache rather than in memory. SMPs are optimized for this case 
by using snooping protocols that broadcast address transactions to all processors. 
Conversely, directory-based shared-memory systems must indir ... 
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Symmetric multiprocessor (SMP) servers provide superior performance for the commercial 
workloads that dominate the Internet. Our simulation results show that over one-third of 
cache misses by these applications result in cache-to-cache transfers, where the data is 
found in another processor's cache rather than in memory. SMPs are optimized for this case 
by using snooping protocols that broadcast address transactions to all processors. 
Conversely, directory-based shared-memory systems must indire ... 

5 A comparative study, of arbitrate 

Shubhendu S. Mukherjee, Federico Silla, Peter Bannon, Joel Emer, Steve Lang, David Webb 
October 2002 Proceedings of the 10th international conference on architectural 

support for programming languages and operating systems, volume 30 , 36 , 

37 Issue 5 , 5 , 10 

Full text available: ^pdf(l : 44.MB}. Additional Information: M.cftation, abstract, references 

Interconnection networks usually consist of a fabric of interconnected routers, which receive 
packets arriving at their input ports and forward them to appropriate output ports. 
Unfortunately, network packets moving through these routers are often delayed due to 
conflicting demand for resources, such as output ports or buffer space. Hence, routers 
typically employ arbiters that resolve conflicting resource demands to maximize the number 
of matches between packets waiting at input ports an ... 

6 The StanM 

3. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. 

Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy 

April 1994 ACM SIGARCH Computer Architecture News , Proceedings of the 21ST 

annual international symposium on Computer architecture, volume 22 issue 2 
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The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory 
and high-performance message passing, while minimizing both hardware and software 
overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global 
memory, a port to the interconnection network, an I/O interface, and a custom node 
controller called MAGIC. The MAGIC chip handles all communication both within the node 
and among nodes, using hardwired data paths for efficient data moveme ... 

7 Using dataflow analysis techniques to reduce ownership overhead in cache coherence 
pr otocols 

Jonas Skeppstedt, Per Stenstrom 

November 1996 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 18 Issue 6 
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In this article, we explore the potential of classical dataflow analysis techniques in removing 
overhead in write-invalidate cache coherence protocols for shared-memory multiprocessors. 
We construct the compiler algorithms with varying degree of sophistication that detect loads 
followed by stores to the same address. Such loads are marked and constitute a hint to the 
cache to obtain an exclusive copy of the block so that the subsequent store does not 
introduce access penalties. The simplest ... 
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David F. Bacon, Susan L. Graham, Oliver J. Sharp 

December 1994 ACM Computing Surveys (CSUR), volume 26 issue 4 
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' ' jerms, review 

In the last three decades a large number of compiler transformations for optimizing 
programs have been implemented. Most optimizations for uniprocessors reduce the number 
of instructions executed by the program using transformations based on the analysis of 
scalar quantities and data-flow techniques. In contrast, optimizations for high-performance 
superscalar, vector, and parallel processors maximize parallelism and memory locality with 
transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimization, 
parallelism, superscalar processors, vectorization 
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Ravi Rajwar, Alain Kagi, James R. Goodman 

June 2003 Proceedings of the 17th annual international conference on 
Supercomputing 

Full text available: ^pdf(568.9 3 KB) Additional Information: full citation , abstract , referen ces, index term s 

Communication latencies within critical sections constitute a major bottleneck in some 
classes of emerging parallel workloads. In this paper, we argue for the use of Inferentially 
Queued Locks (IQLs) [31], not just for efficient synchronization but also for reducing 
communication latencies, and we propose a novel mechanism, Speculative Push (SP), aimed 
at reducing these communication latencies. With IQLs, the processor infers the existence, 
and limits, of a critical section from the use of synch ... 

Keywords: data forwarding, inferential queueing, synchronization 
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June 1998 Proceedings of the tenth annual ACM symposium on Parallel algorithms and 
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This paper compares eight commercial parallel processors along several dimensions. The 
processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent 
Balance system, the Alliant FX series, and the ELXSI System 6400) and four network 
multiprocessors (the BBN Butterfly, the NCUBE, the Intel iPSC/2, and the FPS T Series). The 
paper contrasts the computers from the standpoint of interconnection structures, memory 
configurations, and interprocessor communication. Also, the share ... 

13 Algori thms H : Quantifying instruction criticality for shared memo ry multiprocessors 
Tong Li, Alvin R. Lebeck, Daniel J. Sorin 

June 2003 Proceedings of the fifteenth annual ACM symposium on Parallel algorithms 
and architectures 

Full text available: "|| pdf(200.95 KB) Additional Information: full citation , abstract , references , index terms 

Recent research on processor microarchitecture suggests using instruction criticality as a 
metric to guide hardware control policies. Fields et al. [3, 4] have proposed a directed 
acyclic graph (DAG) model for characterizing program microexecutions on uniprocessors. 
Under such a model, critical path analysis can be applied and instructions' slack values can 
be used to quantify instruction criticality. In this paper, we extend the uniprocessor DAG 
model to characterize parallel program executions ... 

Keywords: critical path analysis, shared memory multiprocessors, slack 



1 4 Token coherence: decoupling performance and correctnes s §n 

Milo M. K. Martin, Mark D. Hill, David A. Wood 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: "H pdf(269.Q8 KB) Additional Information: MU?MiQD., abstract, references 

Many future shared-memory multiprocessor servers will both target commercial workloads 
and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in 
these systems is difficult, because traditional approaches either add indirection for common 
cache-to-cache misses (directory protocols) or require a totally-ordered interconnect 
(traditional snooping protocols). Unfortunately, totally-ordered interconnects are difficult to 
implement in glueless designs. An ideal coherenc ... 

1 5 Memory sha ring pre dictor; the key to a speculative coherent DSM B 
An-Chow Lai, Babak Falsafi 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
Full text available: ||pdf{122.70 KB) Additional Information: full citation , abstract, re ferences , citings , index 
II Publisher Site temife 

Recent research advocates using general message predictors to learn and predict the 
coherence activity in distributed shared memory (DSM). By accurately predicting a message 
and timely invoking the necessary coherence actions, a DSM can hide much of the remote 
access latency. This paper proposes the Memory Sharing Predictors (MSPs), pattern-based 
predictors that significantly improve prediction accuracy and implementation cost over 
general message predictors. An MSP is based on the key ob ... 

1 6 Transactiona l lock-free ex ec ution of lock-based programs B 
Ravi Rajwar, James R. Goodman 

October 2002 Proceedings of the 10th international conference on architectural 

support for programming languages and operating systems, volume 36 , 30 , 

37 Issue 5 , 5 , 10 

Full text available: D.df(1..e.1.MB) Additional Information: fujj.cltation, abstract, references, citings 

This paper is motivated by the difficulty in writing correct high-performance programs. 
Writing shared-memory multi-threaded programs imposes a complex trade-off between 
programming ease and performance, largely due to subtleties in coordinating access to 
shared data. To ensure correctness programmers often rely on conservative locking at the 
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expense of performance. The resulting serialization of threads is a performance bottleneck. 
Locks also interact poorly with thread scheduling and faults, r ... 

17 Architect ure and design of AlphaServer GS320 
Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren 
November 2000 Proceedings of the ninth international conference on Architectural 

support for programming languages and operating systems, volume 28 , 

34 Issue 5 , 5 
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Full text available: TO oaf 413.91 K Eh ; 
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This paper describes the architecture and implementation of the AlphaServer GS320, a 
cache-coherent non-uniform memory access multiprocessor developed at Compaq. The 
AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing 
with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, 
up to 32GB of coherent memory, and an aggressive 10 subsystem. The current 
implementation supports up to 8 such nodes for a total of 32 processors. While s ... 

18 Architecture and desig n of Alph a Server GS320 
Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren 
November 2000 ACM SIGPLAN Notices, Volume 35 issue n 
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This paper describes the architecture and implementation of the AlphaServer GS320, a 
cache-coherent non-uniform memory access multiprocessor developed at Compaq. The 
AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing 
with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, 
up to 32GB of coherent memory, and an aggressive IO subsystem. The current 
implementation supports up to 8 such nodes for a total of 32 processors. While s ... 

19 SafetyNet | 
chec^ 

Daniel J. Sorin, Miio M. K. Martin, Mark D. Hill, David A. Wood 
May 2002 ACM SIGARCH Computer Architecture News, Volume 30 Issue 2 

Full text available: ^ ^ ^b) W Additional Information: Ml citation, .abstract, refereoces, citings, .index 
Publisher Sire 

We develop an availability solution, called SafetyNet, that uses a unified, lightweight 
checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At 
an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of 
the state of a shared memory multiprocessor (i.e., processors, memory, and coherence 
permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a 
fault is detected. SafetyNet e ... 

Keywords: availability, shared memory, multiprocessor 



20 Using destination-set prediction to improve the latency/bandwidth tradeoff in shared- Q 
memory multiprocessors 

Milo M. K. Martin, Pacia J. Harper, Daniel J. Sorin, Mark D. Hill, David A. Wood 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: ^pdf(220J6J<B) Additional Information: fuil citation, abstract, references 

Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory 
multiprocessors. The destination set is the collection of processors that receive a particular 
coherence request. Snooping protocols send requests to the maximal destination set (i.e., 
all processors), reducing latency for cache-to-cache misses at the expense of increased 
traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth 
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